Introduction to Tectonic

Storage systems—specialization vs. generality#

Over the years, organizations have built large distributed storage systems to meet their evolving needs. Such systems are often optimized for specific use cases and might not be a good fit for a general storage workload. The operational complexity of evolving and maintaining many storage systems takes its toll in terms of monetary cost and potential duplicate work. As operational experience with specialized systems builds, system designers often get new insights on how they could use a single generalized system to meet the needs of many use cases.

Note: In system design, we often start with a specialized system that is optimized for a specific use case. Over time it might be possible to consolidate many such specialized systems into one general system, until we get some new use case that a general system is not able to meet. The design activity acts like a pendulum between specialized and general systems over here.

Specialised System
Specialised System
Generalised System
Generalised System
Viewer does not support full SVG 1.1
Systems often swing between specialized and generalized systems over time

The Facebook service is a canonical example where data needs are diverse in terms of workload, and overall data size is huge and increasing. In the following lesson, we’ll discuss Facebook’s storage systems to better understand specialization versus generalization, in the context of storage systems.

2009
2009
2010
2010
2011
2011
2012
2012
2013
2013
2014
2014
2015
2015
2016
2016
2017
2017
2018
2018
2019
2019
2020
2020
10,000
10,000
20,000
20,000
30,000
30,000
40,000
40,000
Size in Exabytes
Size in Exabytes
Years
Years
Viewer does not support full SVG 1.1
The global data growth rate in one decade is about 50 folds (Source: IDC’s Digital Universe Study, sponsored by EMC, December 2012)

Facebook: From a constellation of storage systems to Tectonic#

There are numerous different tenants and hundreds of use cases/applications per tenant, for a variety of storage needs. Blob storage and data warehousing are two major storage applications with different workload characteristics and storage needs.

For a blob store, data access patterns change over time. Some proportion of data is heavily accessed and such a workload needs a substantial number of input-output per second (IOPS) to serve the clients well. Over time, as new hot data comes in, while the older data starts becoming cool off as fewer read/write requests come in for such data. Such data has much lower needs in terms of IOPS but an always growing need for data storage.

Facebook had a specialized storage (Haystack) to store hot data while another system (f4) to move less frequently accessed data to it. To meet the evolving needs of hot data in Haystack, a high number of storage nodes/disks were commissioned to meet the IOPS requirements. There are a limited number of IOPS available per disk, so the required overall count of disks was always increasing. However, these disks were not fully utilized in terms of storage and had a lot of excess capacity.

On the other hand, the f4 system was bottlenecked on storage capacity, while the IOPS needs were nominal. One might wish that Haystack could utilize the excess IOPS of the f4 system and that f4 could utilize the excess storage capacity of Haystack. However, since these are independent systems, there was no provision for such resource sharing, and costly resources were being stranded.

The disks’ storage capacities grew steadily over time while the IOPS per disk essentially stayed the same. This means that the IOPS per terabyte has declined over time. This trend concerns applications that are IOPS-bound (like a blob store).

Created with Fabric.js 3.6.6
Multiple read and write requests on frequently accessed data to the system

1 of 7

Created with Fabric.js 3.6.6
Sending the requests to Haystack

2 of 7

Created with Fabric.js 3.6.6
Moving data frequently used data from Haystack to f4

3 of 7

Created with Fabric.js 3.6.6
Few read and write requests on less frequently accessed data to the system

4 of 7

Created with Fabric.js 3.6.6
Sending the read and write request to f4

5 of 7

Created with Fabric.js 3.6.6
Moving data to f4 and read/write requests from client causes an IOPS bottleneck in Haystack

6 of 7

Created with Fabric.js 3.6.6
Retrieving data in f4 and read/write requests from the client causes storage bottleneck at f4

7 of 7

As a second example application, data warehousing not only needs an enormous amount of data capacity but also the ability to crunch this data to extract business intelligence. Facebook was using multiple clusters of HDFS in a federated fashion. A single HDFS cluster can scale from many Terabytes to a few Petabytes. However, this is not enough for the warehousing application, and multiple HDFS clusters were in use where data was divided between HDFS clusters. Clients were required to keep track of their data to know the HDFS clusters on which the data resides. Going forward, warehouse application data needs are approaching upto multiple Exabytes, and the federated strategy is not only operationally complex but also hard to scale.

Note: Carefully selecting the multiple HDFS clusters so that our needs are met with an efficient use of the clusters’ capacity and available throughput is an instance of a two-dimensional bin-packing problem (which is an NP-hard problem).

The two examples above highlight the problems that arise in specialized storage systems. Facebook’s answer to these challenges was a new, general storage system, Tectonic, that could provide a common storage layer where resources are well utilized. However, applications are still performance-isolated from each other and could meet Facebook’s needs.

Our needs#

Our system is based on the following functional and non-functional requirements.

Functional requirements#

Following are our primary functional requirements:

  • Tectonic should be able to provide multiple Exabytes of storage capacity to its tenants, and this storage should be horizontally scalable.

  • Tectonic should be able to utilize the storage resources well by sharing them with all the tenants.

  • Tectonic should provide configuration knobs to the applications so specific applications could pick and choose specific aspects of the storage system for specific optimizations. An example of such a knob is the ability to choose either full data replication or Reed Solomon-based codes for fault tolerance.

Non-functional requirements#

Following are our non-functional requirements:

  • The availability of many thousands of IOPS and the ability to horizontally scale IOPS over time.

  • Tectonics should ensure performance isolation between applications so that sharing resources does not negatively impact the applications.

  • Tectonics should be highly available because many applications will rely on it for storage needs.

  • Tectonic should provide other usual desirable properties from such a large distributed system, such as fault tolerance, maintainability, etc.

High-level design#

Tectonic will primarily be within a data center file system running on a cluster of servers. A typical cluster can span from hundreds to thousands of servers. A tectonic system consists of three major types of components—a Metadata Store, many Chunk Stores, and some stateless background services. The high-level architecture is shown in the illustration below (we’ll discuss every component of the architecture in detail in the coming lessons).

Chunk Store
Chunk Store
Client Library
Clien...
Background Services
(stateless)


Garbage collector
Rebalancer
Stat service
Disk inventory
Block repair/scan
Storage node health checker
Background Services...
Metadata Store
Metadata Store
Name layer
Name layer
File layer
File layer
Block layer
Block layer
Key-value
Store
Key-value...
Viewer does not support full SVG 1.1
The architecture of Tectonic
  • The client application uses a Client Library through which the end users perform the file and data operations.

  • The Metadata Store consists of stateless metadata services and a scalable key-value store and builds the file system logic on top of the key-value store.

  • The stateless background services provide services such as garbage collection, rebalancer, disk inventory, memory utilization, the maintenance of nodes in the cluster, and many more to improve performance.

  • The Chunk Store is a collection of nodes for storage that maps data onto chunks and places them on the hard disk. The data can be accessed in the form of chunks.

The Client Library requests the Metadata Store for the metadata information, such as the location of chunks of the requested file. The Metadata Store looks into its metadata and responds to the client with the location of the requested chunks in the Chunk Store. The client then asks the Chunk Store for the data operations.

Note: A single Tectonic cluster can store multiple exabytes of data efficiently and allow hundreds of clients to access it concurrently. One exabyte is 101810^{18} bytes or 1000 petabytes.

Bird’s eye view#

In the next lessons, we’ll design and evaluate Tectonic. The following concept map is a quick summary of the problem Tectonic solves and its novelties.

Blob storage
optimization
Data warehouse
optimizations
Multitenant access
control
Distributing ephemeral
resources
Tenant-specific
optimization
Multitenancy
Performance optimization
of tenants
Fair share of resources
among and within tenants
Publish filesystem
statistics
Handle rack drains
Rebalance data
Maintain durability
Maintain consistency
Storage node
health checker
Block repair/scan
Disk inventory
Re-balancer
Garbage collection
Replicated
Reed-Solomon encoded
Block layer
File layer
Name layer
Linearizable
Fault-tolerant
hash-partitioning
Solution
Challenges
Operations
Layered services
Block as a storage unit
Efficient chunk storage
Storing specific metadata
Linear growth of storage nodes
Consistent metadata operations
Caching sealed objects
Metadata Layers
ZippyDB
Single-write semantic
Communicate with Metadata Store
Perform operations directly to Chunk store
Perform I/O operations
Limited size
Reed-Solomon encoded writes
Warm blobs
Hot blobs
Availability
Durability
Efficient resource
management
Data integrity
Scalability
Background stateless
services
Chunk Store
Metadata store
Client library
Bin-packing and dataset-splitting
Poorly managed resources
HDFS
f4
Haystack
Advantages
Components
Drawbacks
Components
Tectonic
Previous System
Facebook Filesystems

In the next lesson, we’ll start building a Tectonic system to meet our needs. We’ll start with the ZippyDB key-value store, which is an integral part of the system.

Quiz on Colossus

ZippyDB Design